ML Workbench Sample --- Image Classification



Introduction of ML Workbench

ML Workbench provides an easy command line interface for machine learning life cycle, which involves four stages:

  • analyze: gather stats and metadata of the training data, such as numeric stats, vocabularies, etc. Analysis results are used in transforming raw data into numeric features, which can be consumed by training directly.
  • transform: explicitly transform raw data into numeric features which can be used for training.
  • train: training model using transformed data.
  • predict/batch_predict: given a few instances of prediction data, make predictions instantly / with large number of instances of prediction data, make predictions in a batched fassion.

There are "local" and "cloud" run mode for each stage. "cloud" run mode is recommended if your data is big.

ML Workbench supports numeric, categorical, text, image training data. For each type, there are a set of "transforms" to choose from. The "transforms" indicate how to convert the data into numeric features. For images, it is converted to fixed size vectors representing high level features.

Transfer learning using Inception Package - Cloud Run Experience With Large Data

ML Workbench supports image transforms (image to vec) with transfer learning.

This notebook continues the codifies the capabilities discussed in this blog post. In a nutshell, it uses the pre-trained inception model as a starting point and then uses transfer learning to train it further on additional, customer-specific images. For explanation, simple flower images are used. Compared to training from scratch, the time and costs are drastically reduced.

This notebook does preprocessing, training and prediction by calling CloudML API instead of running them "locally" in the Datalab container. It uses full data.


In [3]:
# ML Workbench magics (%%ml) are under google.datalab.contrib namespace. It is not enabled by default and you need to import it before use.
import google.datalab.contrib.mlworkbench.commands


Setup


In [2]:
# Create a temp GCS bucket. If the bucket already exists and you don't have permissions, rename it.
!gsutil mb gs://flower-datalab-demo-bucket-large-data


Creating gs://flower-datalab-demo-bucket-large-data/...

Next cell, we will create a dataset representing our training data.


In [5]:
%%ml dataset create
name: flower_data_full
format: csv
train: gs://cloud-datalab/sampledata/flower/train3000.csv
eval: gs://cloud-datalab/sampledata/flower/eval670.csv
schema:
    - name: image_url
      type: STRING
    - name: label
      type: STRING


Analyze

Analysis step includes computing numeric stats (i.e. min/max), categorical classes, text vocabulary and frequency, etc. Run "%%ml analyze --help" for usage. The analysis results will be used for transforming raw data into numeric features that the model can deal with. For example, to convert categorical value to a one-hot vector ("Monday" becomes [1, 0, 0, 0, 0, 0, 0]). The data may be very large, so sometimes a cloud run is needed by adding --cloud flag. Cloud run will start BigQuery jobs, which may incur some costs.

In this case, analysis step only collects unique labels.

Note that we run analysis only on training data, but not evaluation data.


In [6]:
%%ml analyze --cloud
output: gs://flower-datalab-demo-bucket-large-data/analysis
data: flower_data_full
features:
    image_url:
        transform: image_to_vec
    label:
        transform: target


Analyzing column image_url...
column image_url analyzed.
Analyzing column label...
column label analyzed.
Updated property [core/project].

In [7]:
# Check analysis results
!gsutil list gs://flower-datalab-demo-bucket-large-data/analysis


gs://flower-datalab-demo-bucket-large-data/analysis/features.json
gs://flower-datalab-demo-bucket-large-data/analysis/schema.json
gs://flower-datalab-demo-bucket-large-data/analysis/stats.json
gs://flower-datalab-demo-bucket-large-data/analysis/vocab_label.csv

Transform

With analysis results we can transform raw data into numeric features. This needs to be done for both training and eval data. The data may be very large, so sometimes a cloud pipeline is needed by adding --cloud. Cloud run is implemented by DataFlow jobs, so it may incur some costs.

In this case, transform is required. It downloads image, resizes it, and generate embeddings from each image by running a pretrained TensorFlow graph. Note that it creates two jobs --- one for training data and one for eval data.


In [ ]:
# Remove previous results
!gsutil -m rm gs://flower-datalab-demo-bucket-large-data/transform

In [12]:
%%ml transform --cloud
analysis: gs://flower-datalab-demo-bucket-large-data/analysis
output: gs://flower-datalab-demo-bucket-large-data/transform
data: flower_data_full


/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/gcsio.py:113: DeprecationWarning: object() takes no parameters
  super(GcsIO, cls).__new__(cls, storage_client))
/usr/local/lib/python2.7/dist-packages/apache_beam/coders/typecoders.py:135: UserWarning: Using fallback coder for typehint: Any.
  warnings.warn('Using fallback coder for typehint: %r.' % typehint)
running sdist
running egg_info
creating trainer.egg-info
writing requirements to trainer.egg-info/requires.txt
writing trainer.egg-info/PKG-INFO
writing top-level names to trainer.egg-info/top_level.txt
writing dependency_links to trainer.egg-info/dependency_links.txt
writing manifest file 'trainer.egg-info/SOURCES.txt'
reading manifest file 'trainer.egg-info/SOURCES.txt'
writing manifest file 'trainer.egg-info/SOURCES.txt'
warning: sdist: standard file not found: should have one of README, README.rst, README.txt, README.md

running check
warning: check: missing required meta-data: url

creating trainer-1.0.0
creating trainer-1.0.0/trainer
creating trainer-1.0.0/trainer.egg-info
copying files to trainer-1.0.0...
copying setup.py -> trainer-1.0.0
copying trainer/__init__.py -> trainer-1.0.0/trainer
copying trainer/feature_analysis.py -> trainer-1.0.0/trainer
copying trainer/feature_transforms.py -> trainer-1.0.0/trainer
copying trainer/task.py -> trainer-1.0.0/trainer
copying trainer.egg-info/PKG-INFO -> trainer-1.0.0/trainer.egg-info
copying trainer.egg-info/SOURCES.txt -> trainer-1.0.0/trainer.egg-info
copying trainer.egg-info/dependency_links.txt -> trainer-1.0.0/trainer.egg-info
copying trainer.egg-info/requires.txt -> trainer-1.0.0/trainer.egg-info
copying trainer.egg-info/top_level.txt -> trainer-1.0.0/trainer.egg-info
Writing trainer-1.0.0/setup.cfg
Creating tar archive
removing 'trainer-1.0.0' (and everything under it)
DEPRECATION: pip install --download has been deprecated and will be removed in the future. Pip now has a download command that should be used instead.
Collecting google-cloud-dataflow==2.0.0
  Downloading google-cloud-dataflow-2.0.0.tar.gz (576kB)
  Saved /tmp/tmp3nmHby/google-cloud-dataflow-2.0.0.tar.gz
Successfully downloaded google-cloud-dataflow
View job at https://console.developers.google.com/dataflow/job/2017-10-18_10_00_29-3270505292889461844?project=bradley-playground
/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/gcsio.py:113: DeprecationWarning: object() takes no parameters
  super(GcsIO, cls).__new__(cls, storage_client))
/usr/local/lib/python2.7/dist-packages/apache_beam/coders/typecoders.py:135: UserWarning: Using fallback coder for typehint: Any.
  warnings.warn('Using fallback coder for typehint: %r.' % typehint)
running sdist
running egg_info
writing requirements to trainer.egg-info/requires.txt
writing trainer.egg-info/PKG-INFO
writing top-level names to trainer.egg-info/top_level.txt
writing dependency_links to trainer.egg-info/dependency_links.txt
reading manifest file 'trainer.egg-info/SOURCES.txt'
writing manifest file 'trainer.egg-info/SOURCES.txt'
warning: sdist: standard file not found: should have one of README, README.rst, README.txt, README.md

running check
warning: check: missing required meta-data: url

creating trainer-1.0.0
creating trainer-1.0.0/trainer
creating trainer-1.0.0/trainer.egg-info
copying files to trainer-1.0.0...
copying setup.py -> trainer-1.0.0
copying trainer/__init__.py -> trainer-1.0.0/trainer
copying trainer/feature_analysis.py -> trainer-1.0.0/trainer
copying trainer/feature_transforms.py -> trainer-1.0.0/trainer
copying trainer/task.py -> trainer-1.0.0/trainer
copying trainer.egg-info/PKG-INFO -> trainer-1.0.0/trainer.egg-info
copying trainer.egg-info/SOURCES.txt -> trainer-1.0.0/trainer.egg-info
copying trainer.egg-info/dependency_links.txt -> trainer-1.0.0/trainer.egg-info
copying trainer.egg-info/requires.txt -> trainer-1.0.0/trainer.egg-info
copying trainer.egg-info/top_level.txt -> trainer-1.0.0/trainer.egg-info
Writing trainer-1.0.0/setup.cfg
Creating tar archive
removing 'trainer-1.0.0' (and everything under it)
DEPRECATION: pip install --download has been deprecated and will be removed in the future. Pip now has a download command that should be used instead.
Collecting google-cloud-dataflow==2.0.0
  Using cached google-cloud-dataflow-2.0.0.tar.gz
  Saved /tmp/tmp5Syw_T/google-cloud-dataflow-2.0.0.tar.gz
Successfully downloaded google-cloud-dataflow
View job at https://console.developers.google.com/dataflow/job/2017-10-18_10_00_39-10389677604893726558?project=bradley-playground

After transformation is done, create a new dataset referencing the training data.


In [15]:
%%ml dataset create
name: flower_data_full_transformed
format: transformed
train: gs://flower-datalab-demo-bucket-large-data/transform/train-*
eval: gs://flower-datalab-demo-bucket-large-data/transform/eval-*


Train

Training starts from transformed data. If training work is too much to do on the local VM, --cloud is recommended so training happens in cloud, in a distributed way. Run %%ml train --help for details.

Training in cloud is implemented with Cloud ML Engine. It may incur some costs.


In [ ]:
# Remove previous training results.
!gsutil -m rm -r gs://flower-datalab-demo-bucket-large-data/train

In [16]:
%%ml train --cloud
output: gs://flower-datalab-demo-bucket-large-data/train
analysis: gs://flower-datalab-demo-bucket-large-data/analysis
data: flower_data_full_transformed
model_args:
    model: dnn_classification
    hidden-layer-size1: 100
    top-n: 0
cloud_config:
    region: us-central1
    scale_tier: BASIC


Job "trainer_task_171019_051802" submitted.

Click here to view cloud log.

TensorBoard was started successfully with pid 17594. Click here to access it.

After training is complete, you should see model files like the following.


In [17]:
# List the model files
!gsutil list gs://flower-datalab-demo-bucket-large-data/train/model


gs://flower-datalab-demo-bucket-large-data/train/model/
gs://flower-datalab-demo-bucket-large-data/train/model/saved_model.pb
gs://flower-datalab-demo-bucket-large-data/train/model/assets.extra/
gs://flower-datalab-demo-bucket-large-data/train/model/variables/

Batch Prediction

Batch prediction performs prediction in a batched fashion. The data can be large, and is specified by files.

Note that, we use the "evaluation_model" which sits in "evaluation_model_dir". There are two models created in training. One is a regular model under "model" dir, the other is "evaluation_model". The difference is the regular one takes prediction data without target and the evaluation model takes data with target and output the target as is. So evaluation model is good for evaluating the quality of the model because the targets and predicted values are included in output.


In [18]:
%%ml batch_predict --cloud
model: gs://flower-datalab-demo-bucket-large-data/train/evaluation_model
output: gs://flower-datalab-demo-bucket-large-data/evaluation
cloud_config:
    region: us-central1
data:
    csv: gs://cloud-datalab/sampledata/flower/eval670.csv


Job "prediction_171019_054302" submitted.

Click here to view cloud log.


In [19]:
# after prediction is done, check the output
!gsutil list -l -h gs://flower-datalab-demo-bucket-large-data/evaluation


       0 B  2017-10-19T05:51:38Z  gs://flower-datalab-demo-bucket-large-data/evaluation/prediction.errors_stats-00000-of-00001
       0 B  2017-10-19T04:26:01Z  gs://flower-datalab-demo-bucket-large-data/evaluation/prediction.results-00000-of-00001
  9.91 KiB  2017-10-19T05:51:38Z  gs://flower-datalab-demo-bucket-large-data/evaluation/prediction.results-00000-of-00006
 21.86 KiB  2017-10-19T05:51:38Z  gs://flower-datalab-demo-bucket-large-data/evaluation/prediction.results-00001-of-00006
  9.35 KiB  2017-10-19T05:51:38Z  gs://flower-datalab-demo-bucket-large-data/evaluation/prediction.results-00002-of-00006
 17.92 KiB  2017-10-19T05:51:38Z  gs://flower-datalab-demo-bucket-large-data/evaluation/prediction.results-00003-of-00006
 60.27 KiB  2017-10-19T05:51:38Z  gs://flower-datalab-demo-bucket-large-data/evaluation/prediction.results-00004-of-00006
 13.68 KiB  2017-10-19T05:51:38Z  gs://flower-datalab-demo-bucket-large-data/evaluation/prediction.results-00005-of-00006
TOTAL: 8 objects, 136179 bytes (132.99 KiB)

In [21]:
# Take a look at the file.
!gsutil cat -r -500 gs://flower-datalab-demo-bucket-large-data/evaluation/prediction.results-00000-of-00006


 "sunflowers": 0.023919522762298584, "predicted": "daisy", "daisy": 0.5709955096244812}
{"target": "sunflowers", "dandelion": 6.407364847637394e-16, "tulips": 1.1489890098563176e-21, "roses": 3.703789355800949e-23, "sunflowers": 1.0, "predicted": "sunflowers", "daisy": 4.062632231285804e-18}
{"target": "sunflowers", "dandelion": 3.1532248600391055e-12, "tulips": 1.2175611572835896e-15, "roses": 1.7113922181424834e-16, "sunflowers": 1.0, "predicted": "sunflowers", "daisy": 6.292334708835057e-12}

Prediction results are in JSON format. We can load the results into BigQuery table and performa analysis.


In [23]:
import google.datalab.bigquery as bq

schema = [
  {'name': 'predicted', 'type': 'STRING'},
  {'name': 'target', 'type': 'STRING'},
  {'name': 'daisy', 'type': 'FLOAT'},  
  {'name': 'dandelion', 'type': 'FLOAT'},  
  {'name': 'roses', 'type': 'FLOAT'},
  {'name': 'sunflowers', 'type': 'FLOAT'},
  {'name': 'tulips', 'type': 'FLOAT'},
]

bq.Dataset('image_classification_results').create()
t = bq.Table('image_classification_results.flower').create(schema = schema, overwrite = True)
t.load('gs://flower-datalab-demo-bucket-large-data/evaluation/prediction.results-*', mode='overwrite', source_format='json')


Out[23]:
Job bradley-playground/job_6U6Mnzk1o-KCAdLob2PwMs4_0lWk completed

Check wrong predictions.


In [24]:
%%bq query
SELECT * FROM image_classification_results.flower WHERE predicted != target


Out[24]:
predictedtargetdaisydandelionrosessunflowerstulips
rosesdaisy0.0001070818616434.2452506932e-060.9853317141530.014477844357.90482226876e-05
rosesdaisy0.003369041485710.08177194744350.8988310098650.006066276226190.00996168423444
rosesdaisy9.25164749788e-063.04345193491e-090.9999369382861.34516176331e-065.23670387338e-05
rosesdaisy6.38103125894e-091.10664151454e-110.9999254941942.96761953678e-077.42329939385e-05
tulipsdaisy4.83075413005e-104.30522035799e-081.14557883535e-063.37842357112e-060.999995470047
dandeliondaisy8.15960166038e-191.03.36299384949e-267.30872310153e-273.52697304699e-24
dandeliondaisy0.04065700620410.9592769145972.83383229771e-074.56821435364e-052.0248007786e-05
dandeliondaisy3.2681695302e-050.9999636411673.94953673322e-073.09938968712e-062.75721021126e-07
dandeliondaisy5.03894037607e-101.09.46683408883e-162.09550150669e-134.55575959173e-14
dandeliondaisy0.1396252959970.8596177101141.33036223815e-060.0006848402554177.08434599801e-05
sunflowersdaisy7.75286750354e-083.92487065071e-090.0001050555292750.9998948574073.0220762004e-10
daisyroses0.9906168580062.06143031392e-054.30926206718e-070.009338499978182.35830357269e-05
tulipsroses7.71652697296e-094.74015735108e-072.31539361266e-055.86078713241e-050.99991774559
tulipsroses2.72695858949e-111.79968728808e-095.24185139739e-083.03490395481e-070.999999642372
tulipsroses3.47334845347e-134.56423744286e-130.3485145568851.56286994457e-090.651485443115
tulipsroses0.0001777686993590.0002314403245690.001933783059940.009938023984430.987718939781
tulipsroses7.39195138522e-061.26397480926e-060.004816633649173.03236829495e-050.995144307613
tulipsroses3.43440172135e-111.02370098509e-108.96404444006e-081.49483753376e-070.999999761581
tulipsroses3.0772439629e-169.47765824155e-122.44594317023e-101.03360986436e-091.0
tulipsroses1.13629557152e-109.97475257947e-090.06327238678934.08878668168e-060.936723470688
tulipsroses8.4810232635e-188.01686787701e-131.48080445683e-085.61062932225e-111.0
tulipsroses4.21678425511e-096.56204051097e-086.52195303701e-051.43616239257e-060.999933242798
tulipsroses1.03960369202e-082.06416939363e-063.89039882975e-077.15018468327e-050.99992609024
tulipsroses1.07436523724e-181.40168570652e-143.7043668133e-098.53360611341e-121.0
dandelionroses0.02252696640790.9767621159557.62510410368e-080.0006741039687773.66673339158e-05

(rows: 63, time: 1.1s, 39KB processed, job: job_dUyxZRewAJ9ddEEFxDaNbqRGNHGU)



In [26]:
%%ml evaluate confusion_matrix --plot
bigquery: image_classification_results.flower



In [27]:
%%ml evaluate accuracy
bigquery: image_classification_results.flower


Out[27]:
target accuracy count
0 daisy 0.909836 122
1 roses 0.857143 119
2 tulips 0.884615 130
3 dandelion 0.925926 162
4 sunflowers 0.941606 137
5 _all 0.905970 670

Online Prediction and Build Your Own Prediction Client

Please see "Flower Classification (small dataset experience)" notebook for how to deploy the trained model and build your own prediction client.

Cleanup


In [ ]:
!gsutil -m rm -rf gs://flower-datalab-demo-bucket-large-data

In [ ]: